Prepare data / feature engineering

data <- read_data()
# data <- read_data("../data/lgd_dataset.csv")

data$appartment_collateral_mv <- ifelse(
  data$real_estate_type == "appartment",
  data$mortgage_collateral_mv,
  0
)
data$house_collateral_mv <- ifelse(
  data$real_estate_type == "single family house",
  data$mortgage_collateral_mv,
  0
)
data$office_collateral_mv <- ifelse(
  data$real_estate_type == "office building",
  data$mortgage_collateral_mv,
  0
)
data$single_family_house <- ifelse(
  data$real_estate_type == "single family house",
  1,
  0
)
data$lgd_nom <- data$lgd * data$loan_amount
data$lgd_log <- log(data$lgd+1)
data$house_collateral_ratio <- data$house_collateral_mv / data$loan_amount
data$appartment_collateral_ratio <- data$appartment_collateral_mv / data$loan_amount
data$office_collateral_ratio <- data$office_collateral_mv / data$loan_amount
data$additional_collateral_ratio <- data$additional_collateral_mv / data$loan_amount

data$loan_amount_log <- log(data$loan_amount)
data$house_collateral_log <- log(data$house_collateral_mv+1)
data$appartment_collateral_log <- log(data$appartment_collateral_mv+1)
data$office_collateral_log <- log(data$office_collateral_mv+1)
data$additional_collateral_log <- log(data$additional_collateral_mv+1)

data$rr <-  1 - data$lgd
data$rr_nom <- data$rr * data$loan_amount
data$rr_nom_log <- log(data$rr_nom)

data$loan_to_mortg <- data$loan_amount / data$mortgage_collateral_mv
data$loan_to_addit <- data$loan_amount / data$additional_collateral_mv
data$loan_to_colla <- data$loan_amount / (data$mortgage_collateral_mv + data$additional_collateral_mv)

Segmentation

It makes intuitively sense that the dynamics with which loans are provided, repaid and the overall riskyness of a loan is dependent on the loan type. For example, higher risk might be associated with a loan which is used by a business to enter a new market compared to the buying of a house by a private person. Looking at the distribution of the recovery in CHF makes the difference visible (using logs for better visibility):

Therefore, we might want to fit one model for private customers and one model for corporate customers separately.


Ad hoc manipulation of the predicted dependent variable

Note that using some model (e.g. linear regression) we might predict negative values for loss given default which - at least in this context - does not make sense. Therefore, after prediction the estimates which are negative are set to zero.


Segment 1: Private Clients

First, let’s subset the dataset and only focus on private clients. Note that the observed loss given default seems to depend on the type of real estate:

Economic intuition: selling conditions for real estate might differ across real estate types. Single family houses might to be more difficult to sell - or more generally can be sold in case of default with a higher loss.

For this reason, the market value per real estate type is treated differently, i.e. for each real estate category one column is added containing the market value if the corresponding contract is on the given real estate type and zero otherwise, same of additional collateral:

real_estate_type house_collateral_mv appartment_collateral_mv
736 single family house 1636145 0.0
827 single family house 1766339 0.0
719 single family house 1521278 0.0
274 appartment 0 1444293.6
648 single family house 1784238 0.0
519 appartment 0 920868.9

If we consider nominal values, there should be a positive relationship between the market values of the collateral types and the nominal loss given default (resp. the recovery in CHF):

Hence, the market values could be a good predictor of the loss given default in CHF, which then can be converted to the loss given default in % by dividing with the loan amount.


Model 1

The first model uses the corresponding market values of the collateral to predict the loss given default in CHF:

\[ \text{nominal LGD}_i = \beta_0 + \beta_1*\text{MV appartment}_i + \beta_2*\text{MV house}_i + \beta_3*\text{MV additional}_i + \epsilon_i \]

Summary of this model:


Call:
lm(formula = reformulate(regressors, dependent_variable), data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-248109  -19027   -5268    3369  611394 

Coefficients:
                            Estimate  Std. Error t value  Pr(>|t|)    
(Intercept)              -6620.18795  4670.52353  -1.417   0.15673    
loan_amount                  0.22713     0.05127   4.430 0.0000107 ***
appartment_collateral_mv    -0.16534     0.04010  -4.123 0.0000411 ***
house_collateral_mv         -0.13249     0.04015  -3.300   0.00101 ** 
additional_collateral_mv    -0.09204     0.05156  -1.785   0.07461 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 68200 on 837 degrees of freedom
Multiple R-squared:  0.2508,    Adjusted R-squared:  0.2472 
F-statistic: 70.06 on 4 and 837 DF,  p-value: < 0.00000000000000022

This models RMSE score is 47349.3 CHF resp. 4.79%.

Following plot shows the difference between the realized LGD values and the corresponding predictions:

Note that the predictions for single family houses (on the RHS of vertical line) are systematically worse than for apartments:

We might to look deeper into this difference by real estate types.

Overall, we estimate a loss of 23.07 mio. CHF while the observed loss is 22.64 mio. CHF, hence we overestimate the loss by 0.43 mio. CHF what we might prefer from a risk management point of view (compared to underestimate the loss).


Model 2

Note the distribution of the target variable

Maybe converting the target variable to logs and hence also the predictors such that they have similar scaling might lead to better performance. Note that the target variable includes a lot of zeros, which result in -Inf when converting to logs which cannot be used for regression. Therefore, the loss given default is converted to the recovery rate resp. the recovery in CHF before taking the log, leading to following distribution:

Unfortunately, there is no economic valid way to transform the market value of collaterals such that we do not have any zeros anymore. Therefore, one CHF is added to all market values, resulting in a negligible difference if the market value is positive as the values are quite large and zero for no collateral since \(log(1) = 0\).

The model which is fitted now:

\[ \text{log(nominal recovery)}_i = \beta_0 + \beta_1*\text{log(MV appartment)}_i + \beta_2*\text{log(MV house)}_i + \beta_3*\text{log(MV additional)}_i + \epsilon_i \]


Call:
lm(formula = reformulate(regressors, dependent_variable), data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.34719  0.00238  0.01203  0.01947  0.07834 

Coefficients:
                           Estimate Std. Error t value             Pr(>|t|)    
(Intercept)               0.0127381  0.0480673   0.265             0.791070    
loan_amount_log           0.8353316  0.0441020  18.941 < 0.0000000000000002 ***
appartment_collateral_log 0.1597596  0.0437136   3.655             0.000274 ***
house_collateral_log      0.1568866  0.0437039   3.590             0.000350 ***
additional_collateral_log 0.0012379  0.0005057   2.448             0.014565 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.054 on 837 degrees of freedom
Multiple R-squared:  0.9912,    Adjusted R-squared:  0.9912 
F-statistic: 2.365e+04 on 4 and 837 DF,  p-value: < 0.00000000000000022

[1] -1209921
[1] "we expect 23.851894932812 mio CHF loss. Real loss was 22.6419735018877"

Overall, we estimate a loss of 23.85 mio. CHF while the observed loss is 22.64 mio. CHF, hence we overestimate the loss by 1.21 mio. CHF what we might prefer from a risk management point of view (compared to underestimate the loss). However, the first model also overestimates the loss but is still more accurate. Therefore, we might prefer model 1 over model 2.

Model 3

Note that the response variable LGD cannot be negative and therefore is bounded. There might be an underlying linear model but the observed LGD values are truncated:

\[ \text{LGD}_i = \max(\widehat{LGD}_i, 0) \]

Therefore, let’s try tobit regression


Call:
AER::tobit(formula = lgd ~ appartment_collateral_ratio + house_collateral_ratio + 
    additional_collateral_ratio, left = 0, right = 1, dist = "logistic", 
    data = segment_private)

Observations:
         Total  Left-censored     Uncensored Right-censored 
           842            617            225              0 

Coefficients:
                            Estimate Std. Error z value             Pr(>|z|)
(Intercept)                  0.93431    0.14793   6.316     0.00000000026894
appartment_collateral_ratio -0.81429    0.11954  -6.812     0.00000000000962
house_collateral_ratio      -0.72907    0.11746  -6.207     0.00000000053971
additional_collateral_ratio -0.78708    0.14184  -5.549     0.00000002873278
Log(scale)                  -2.74016    0.06004 -45.639 < 0.0000000000000002
                               
(Intercept)                 ***
appartment_collateral_ratio ***
house_collateral_ratio      ***
additional_collateral_ratio ***
Log(scale)                  ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Scale: 0.06456 

Logistic distribution
Number of Newton-Raphson Iterations: 4 
Log-likelihood: -85.26 on 5 Df
Wald-statistic: 100.8 on 3 Df, p-value: < 0.000000000000000222 

Note: Tobit predicts what the result ought to be in the absence of the truncation. Therefore, convert negative estimates to 0.

Although the tobit model seems to predict more accurately, the problem is that using the tobit model we estimate a loss of 4.75 mio. CHF while the observed loss is 22.64 mio. CHF, hence we highly underestimate the loss by 17.89 mio.

We might change the lower bound of our estimations, i.e. we do not predict an LGD below a certain value. To still have a data driven decision process, we might set this lower bound to the mean of observed LGD per real estate type:

Now we estimate a loss of 21.79 mio. CHF while the observed loss is 22.64 mio. CHF, hence we still underestimate the loss by 0.85 mio. We might switch back to simple linear regression.

Model 4

Note that the predicted values for single family house is much worse than for apartments. Therefore, one might fit a model on the subset for apartment and subset for single family house separately:

[1] 67651.11

Still the same problem…

[1] -284405.2
[1] "we expect 22.926378746522 mio CHF loss. Real loss was 22.6419735018877"

We still have the same problem…


Segment 2: Corporate Clients

First, note that the only real estate type of corporate clients is office building. Therefore, the positions only vary in loan amount, the market value of the collateral and whether the client provides additional collateral and its corresponding market value. Therefore, we need might further logic to segment corporate clients more granular.

Looking at the ratio \(\frac{\text{loan amount}}{\text{collateral}}\), we can see that not all clients must provide proportionally the same value of collateral.

Economic intuition: due to client specific factors, there could be higher or lower risk associated with the position, which require more or less provided collateral.

mean lgd for left tail: 0.0856024

mean lgd for right tail: 0.2684521

mean lgd over whole sample:0.1388319

Given this logic, there should be a positive relationship between loan to collateral ratio and loss given default:

Model 1

First try simple model of regressing on loan amount and market value of collateral types on LGD in CHF:

\[ \text{nominal LGD}_i = \beta_0 + \beta_1*\text{loan amount}_i + \beta_2*\text{MV office}_i + \beta_3*\text{MV additional}_i + \epsilon_i \]


Call:
lm(formula = reformulate(regressors, dependent_variable), data = data)

Residuals:
      Min        1Q    Median        3Q       Max 
-15620458  -1092449   -237547    582744  20755362 

Coefficients:
                            Estimate  Std. Error t value            Pr(>|t|)
(Intercept)               72141.3778 147286.3091   0.490               0.624
loan_amount                   1.7057      0.1513  11.276 <0.0000000000000002
office_collateral_mv         -1.2218      0.1174 -10.406 <0.0000000000000002
additional_collateral_mv     -1.3694      0.1557  -8.797 <0.0000000000000002
                            
(Intercept)                 
loan_amount              ***
office_collateral_mv     ***
additional_collateral_mv ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2732000 on 607 degrees of freedom
Multiple R-squared:  0.4135,    Adjusted R-squared:  0.4106 
F-statistic: 142.7 on 3 and 607 DF,  p-value: < 0.00000000000000022

Root Mean Squared Error:

[1] 0.1558712

The prediction performance is not really sufficient… Let’s analyze if we can find a relationship between the prediction error and some variables.

Look at the relationship between the loan to collateral ratio and the prediction error:

It seems that when increasing the loan to collateral ratio, we tend to overestimate the loss given default. Ideas to address this:

  • discreet loan to collateral to dummy variables
  • segment data set according to loan to collateral and fit model separately

This makes economically sense:

If \(\frac{\text{loan amount}}{\text{collateral}}\) higher, we need less collateral to cover the loan, meaning the corresponding client might be associated with lower risk, hence the expected loss given default should be lower (ceteris paribus). If we do not account for this relationship, we ignore the fact that the client is of lower risk and therefore has a lower loss given default.

Differentiate by collateral type:

Model 2

Using this logic, let’s first just run a regression of loan to collateral on LGD:

\[ \text{LGD}_i = \beta_0 + \beta_1*\text{Loan to Collateral}_i + \epsilon_i \]


Call:
lm(formula = reformulate(regressors, dependent_variable), data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.18465 -0.12691 -0.03866  0.09584  0.51874 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)   
(Intercept)         -0.3636     0.1522  -2.389  0.01718 * 
loan_to_collateral   0.6397     0.1936   3.304  0.00101 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1484 on 609 degrees of freedom
Multiple R-squared:  0.01761,   Adjusted R-squared:  0.016 
F-statistic: 10.92 on 1 and 609 DF,  p-value: 0.001008

Root Mean Squared Error:

[1] 0.1483979

The fit is not really that much better…

Model 3

Now “combine” model 1 and 2 by regressing LGD on loan to collateral, office collateral to loan ratio and the additional collateral to loan ratio:

\[ \text{LGD}_i = \beta_0 + \beta_1*\text{Loan to Collateral}_i +\beta_2*\text{MV Office to Loan}_i + \beta_3*\text{MV additional to Loan}_i + \epsilon_i \]


Call:
lm(formula = reformulate(regressors, dependent_variable), data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.18796 -0.12687 -0.03173  0.09724  0.50694 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)
(Intercept)                  -1.4743     4.4002  -0.335    0.738
loan_to_collateral            1.3442     2.8354   0.474    0.636
office_collateral_ratio       0.4463     1.7056   0.262    0.794
additional_collateral_ratio   0.2245     1.7204   0.131    0.896

Residual standard error: 0.148 on 607 degrees of freedom
Multiple R-squared:  0.026, Adjusted R-squared:  0.02118 
F-statistic: 5.401 on 3 and 607 DF,  p-value: 0.001133

Root Mean Squared Error:

[1] 0.1478913

Model 4

Maybe we get better performance if we further segmentize our sample. Let’s divide our data set by loan amount (high, medium and low loans):

threshold_high <- 0.90
threshold_low <- 0.75
segment_corporate_high <- segment_corporate[segment_corporate$loan_amount >= quantile(segment_corporate$loan_amount, threshold_high),]
segment_corporate_medium <- segment_corporate[segment_corporate$loan_amount >= quantile(segment_corporate$loan_amount, threshold_low) & segment_corporate$loan_amount < quantile(segment_corporate$loan_amount, threshold_high),]
segment_corporate_low <- segment_corporate[segment_corporate$loan_amount < quantile(segment_corporate$loan_amount, threshold_low),]
hist(segment_corporate$loan_amount, breaks = 100)
abline(v = quantile(segment_corporate$loan_amount, threshold_high))
abline(v = quantile(segment_corporate$loan_amount, threshold_low))

Fit model 3 on each subset separately:

\[ \text{LGD}_i = \beta_0 + \beta_1*\text{Loan to Collateral}_i +\beta_2*\text{MV Office to Loan}_i + \beta_3*\text{MV additional to Loan}_i + \epsilon_i \]

Dependent variable:
regressors
highest high regular
(1) (2) (3)
loan_amount 2.523*** 0.171 0.157
(0.444) (0.393) (0.175)
office_collateral_mv -1.896*** 0.023 0.013
(0.343) (0.305) (0.137)
additional_collateral_mv -2.024*** -0.190 -0.411**
(0.457) (0.405) (0.173)
Constant 3,018,068.000 -1,255,298.000 -57,557.440
(1,854,473.000) (2,100,442.000) (114,372.200)
Observations 62 91 458
R2 0.378 0.047 0.226
Adjusted R2 0.346 0.014 0.221
Residual Std. Error 6,338,295.000 (df = 58) 3,299,786.000 (df = 87) 1,207,234.000 (df = 454)
F Statistic 11.745*** (df = 3; 58) 1.425 (df = 3; 87) 44.206*** (df = 3; 454)
Note: p<0.1; p<0.05; p<0.01

RMSE for high segment: 12.97% resp. 8598379.84 CHF
RMSE for medium segment: 15.59% resp. 2770251.67 CHF
RMSE for low segment: 15.04% resp. 1198669.42 CHF

Not sufficient…

Model 5

Let’s use the logic we derived earlier and segmentize the data set in low, medium and high risk clients:

segment_corporate_high <- segment_corporate[segment_corporate$loan_to_collateral > mu+sigma,]
segment_corporate_medium <- segment_corporate[segment_corporate$loan_to_collateral <= sigma+mu & segment_corporate$loan_to_collateral >= mu-sigma,]
segment_corporate_low <- segment_corporate[segment_corporate$loan_to_collateral < mu-sigma,]

fit model for high loan to collateral (lower risk)


Call:
lm(formula = reformulate(regressors, dependent_variable), data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-6748572  -667852   221447   840999  6984018 

Coefficients:
                             Estimate   Std. Error t value Pr(>|t|)  
(Intercept)              -646387.2240  283041.0449  -2.284   0.0247 *
loan_amount                    0.4975       1.2268   0.406   0.6861  
office_collateral_mv          -0.2078       1.0086  -0.206   0.8373  
additional_collateral_mv      -1.0614       1.1861  -0.895   0.3733  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1990000 on 91 degrees of freedom
Multiple R-squared:  0.7433,    Adjusted R-squared:  0.7348 
F-statistic: 87.83 on 3 and 91 DF,  p-value: < 0.00000000000000022

RMSE:

[1] 0.1756801

fit model for medium loan to collateral (medium risk)


Call:
lm(formula = reformulate(regressors, dependent_variable), data = data)

Residuals:
      Min        1Q    Median        3Q       Max 
-14766729   -958151   -180448    676154  19011234 

Coefficients:
                            Estimate  Std. Error t value          Pr(>|t|)    
(Intercept)              -55152.3550 187763.9106  -0.294             0.769    
loan_amount                   2.7166      0.3636   7.471 0.000000000000476 ***
office_collateral_mv         -2.0012      0.2856  -7.006 0.000000000010005 ***
additional_collateral_mv     -2.2572      0.3062  -7.371 0.000000000000926 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2804000 on 414 degrees of freedom
Multiple R-squared:  0.429, Adjusted R-squared:  0.4248 
F-statistic: 103.7 on 3 and 414 DF,  p-value: < 0.00000000000000022

RMSE:

[1] 0.152575

fit model for low loan to collateral (higher risk)


Call:
lm(formula = reformulate(regressors, dependent_variable), data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-3999076 -1245613  -875950    54113 10544572 

Coefficients:
                             Estimate   Std. Error t value Pr(>|t|)   
(Intercept)              1065542.4459  326731.7308   3.261  0.00155 **
loan_amount                   -0.5902       0.5159  -1.144  0.25551   
office_collateral_mv           0.4539       0.3839   1.182  0.24008   
additional_collateral_mv       0.1962       0.4062   0.483  0.63018   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2411000 on 94 degrees of freedom
Multiple R-squared:  0.04529,   Adjusted R-squared:  0.01482 
F-statistic: 1.486 on 3 and 94 DF,  p-value: 0.2233

RMSE:

[1] 0.2730157